Skip to content

Feature/client daemon refactoring#256

Merged
joelteply merged 20 commits intomainfrom
feature/client-daemon-refactoring
Jan 13, 2026
Merged

Feature/client daemon refactoring#256
joelteply merged 20 commits intomainfrom
feature/client-daemon-refactoring

Conversation

@joelteply
Copy link
Contributor

Summary

Brief description of changes and why they're needed

Change Type & Scale

  • 🐛 Bug fix (fixes an issue)
  • ✨ New feature (adds functionality)
  • 📚 Documentation (README, guides, comments)
  • 🔧 Configuration (ESLint, CI, build tools)
  • 🧹 Cleanup (refactor, remove dead code)
  • 💥 Breaking change (existing functionality changes)

Scale:

  • 📏 Small (<50 files changed)
  • 📐 Medium (50-200 files changed)
  • 📊 Large integration (200+ files changed)

Testing & Verification

  • ✅ Local testing completed
  • 🧪 npm run lint passes
  • 🔍 npm test passes
  • 📸 Screenshot/visual verification (if UI changes)
  • 🤖 Portal commands tested: python python-client/ai-portal.py --cmd tests

AI Development Notes

  • 🔄 Maintains backward compatibility with existing AI agents
  • 📋 Updated CLAUDE.md if process changes
  • 🎯 Follows modular architecture principles
  • 🚨 Emergency verification system still works

Status & Readiness

  • 🟢 Ready to merge (all checks pass, no known issues)
  • 🟡 Merge with caution (some issues documented below)
  • 🔴 Do not merge yet (major issues need resolution)

Known Issues: (if any)

Files Changed

List key files and why they changed

Related Issues

Fixes #(issue) or Relates to #(issue)


For AI Agents: Use python python-client/ai-portal.py --dashboard to verify system health after merging

Joel and others added 19 commits January 10, 2026 22:56
- Create worker_pool.rs with multi-instance model loading
- Each worker has own QuantizedModelState + Metal GPU device
- Request channel distributes work via tokio mpsc
- Semaphore tracks available workers for backpressure
- Auto-detect worker count based on system memory (~2GB per worker)
- Update InferenceCoordinator: maxConcurrent 1→3, reduced cooldowns
- Fallback to single BF16 instance when LoRA adapters requested

Before: 1 request/~6s + 30s timeout cascade
After: 4 requests/~6s in parallel, no timeouts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Daemons start in dependency order: critical → integration → lightweight
- Critical path (data, command, events, session): 350ms max
- Integration daemons wait for DataDaemon before starting
- Lightweight daemons (health, widget, logger) start immediately
- Phase breakdown metrics logged for observability

Phases:
- critical: 4 daemons, max=207ms (UI can render)
- integration: 7 daemons, max=3518ms (AIProvider bottleneck)
- lightweight: 7 daemons, max=130ms

Total startup: 3531ms (critical path ready much sooner)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix SystemOrchestrator navigate command (remove invalid --path param)
- Fix launch-and-capture.ts: check ping, refresh if connected, open if not
- Fix SystemMetricsCollector countCommands to use ping instead of browser logs
- Add deterministic UUIDs for seeded users (Joel, Claude Code)
- Improve UserDaemonServer error logging for persona client creation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Browser launch:
- ALWAYS open browser window, don't just reload
- Ensures user sees something even if WebSocket connected but window closed
- Both SystemOrchestrator and launch-and-capture now open browser unconditionally

UUID fix:
- stringToUUID now generates valid 36-char UUIDs (was generating 32-char)
- Last segment now correctly 12 chars instead of 8

ChatWidget:
- Use server backend for $in queries (localStorage doesn't support $in)
- Add debug logging for member loading

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- DataReadBrowserCommand now supports backend:'server' to bypass localStorage
- ChatWidget uses server backend for room queries to avoid stale cache
- Fixes issues with members not loading due to localStorage not supporting $in

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
CONSOLIDATION (Ministry of Code Deletion):
- RoutingService is now THE single source of truth for room/user resolution
- Added resolveRoomIdentifier() and resolveUserIdentifier() convenience functions
- Added name fallback query for legacy support
- Migrated ChatSendServerCommand: deleted findRoom(), uses RoutingService
- Migrated ChatAnalyzeServerCommand: deleted resolveRoom(), uses RoutingService
- Migrated ChatPollServerCommand: deleted inline resolution, uses RoutingService
- WallTypes.isRoomUUID() now delegates to RoutingService.isUUID()
- MainWidget: deleted dead handleTabClick/handleTabClose, simplified openContentTab

ETHOS (CLAUDE.md):
- Added "The Compression Principle" - one logical decision, one place
- Added "The Methodical Process" - 8 mandatory steps, outlier validation
- Encoded the Ministry philosophy: deletion without loss = compression = efficiency

Net change: +306/-298 lines (compression-neutral while adding documentation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key fixes:
- data-clear.ts: Clear session metadata during reseed to prevent stale
  entityIds from persisting (root cause of corrupted UUID bug)
- MainWidget.ts: Add userId setup with retry in openContentFromUrl() and
  initializeContentTabs() to ensure ContentService can persist to database
- RoutingService.ts: Fix example UUID in comment
- SchemaBasedFactory.ts: Fix hardcoded test UUID

The corrupted UUID issue (5e71a0c8-0303-4eb8-a478-3a121248) was caused by
stale session metadata files that weren't cleared during data reseed.
The session files stored old entityIds that no longer existed after
reseeding the database.

ContentState persistence now works - tabs are saved to database with
correct UUIDs. Tab restore on refresh still needs investigation due to
session management timing issues.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: Browser's OfflineStorageAdapter caches user_states in
localStorage. When tabs are opened, the server database is updated,
but localStorage retains stale data. On page refresh, loadUserContext()
would get old cached data with fewer/no openItems.

Fix: Add `backend: 'server'` to user_states query in loadUserContext()
to bypass localStorage cache and always fetch fresh contentState from
the server database.

Also added debug logging (temporary) to help diagnose initialization
timing issues between loadUserContext() and initializeContentTabs().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- inference-grpc: Fix dead code, use pool stats, proper strip_prefix
- data-daemon: Fix HDD acronym, add type alias for complex type
- inference: Collapse nested if-let
- model.rs: Use struct literal instead of removed new()

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: UserDaemon's initializeDeferred() tried to create persona
clients before JTAGSystem.daemons was populated. DataDaemon emits
system:ready during its initialize() phase, triggering UserDaemon's
ensurePersonaClients() which needs CommandDaemon. But CommandDaemon
wasn't yet registered to JTAGSystem.daemons (only happens AFTER
orchestrator.startAll() returns).

Fix:
1. CommandDaemonServer now registers itself to globalThis during its
   initialize() phase, providing early access for other daemons
2. JTAGSystem.getCommandsInterface() now checks globalThis first,
   falling back to this.daemons for compatibility

Also fixed Clippy duplicate_mod warning in training-worker:
- logger_client.rs now re-exports JTAG protocol types
- messages.rs uses re-exports instead of including jtag_protocol directly

Verified: All 11 AI personas now healthy and responding to messages.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Per-persona inference logging confirmed (Helper AI, Teacher AI, etc.)
- System utilities correctly show [unknown] in Rust logs
- AI responses verified working via Candle gRPC and cloud APIs
- Version bump 1.0.7184

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1. SignalDetector: Switch from local (slow) to Groq (fast)
   - Was flooding local inference queue with classification calls
   - Groq responds in <1s vs local ~10s
   - Frees local queue for actual persona responses

2. CandleGrpcAdapter: Add prompt truncation (24K char limit)
   - Prevents "narrow invalid args" tensor dimension errors
   - Large RAG contexts were sending 74000+ char prompts
   - Model has 8K token (~32K char) context window
   - Truncation preserves system prompt + recent messages

Before: Constant queue backlog, tensor errors, hangs
After: Workers have availability, no tensor errors, faster responses

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The chat widget was unlatching from the bottom when large messages
arrived because the fixed 200px threshold was too small.

Changes:
- Add isLatchedToBottom state to track user intent
- Dynamic threshold: max of config, 50% viewport, or 500px
- ResizeObserver checks latch state instead of distance
- Scroll handler updates latch with tighter 100px threshold
- Scroll listener active when autoScroll enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The scrollToEnd was called immediately after adding items to DOM,
but the browser hadn't laid them out yet. Using double-rAF ensures
the DOM is fully rendered before scrolling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 13, 2026 23:16
@joelteply joelteply merged commit 5a438bb into main Jan 13, 2026
3 of 5 checks passed
@joelteply joelteply deleted the feature/client-daemon-refactoring branch January 13, 2026 23:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to be a major refactoring focused on daemon/worker infrastructure improvements, including memory management for inference workers, worker pool implementation for concurrent inference, improved process management scripts, and extensive cleanup of deprecated test/script files.

Changes:

  • Added memory limits and worker pool support for inference workers
  • Refactored Rust worker modules to avoid duplicate code and improve structure
  • Improved shell scripts for starting/stopping workers with better process tracking
  • Updated TypeScript widgets and system core for better content state management
  • Removed 40+ deprecated test and script files

Reviewed changes

Copilot reviewed 142 out of 144 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
workers-config.json Added memory limits configuration and per-worker memory settings
workers/training/src/messages.rs Refactored to re-export protocol types from logger_client
workers/stop-workers.sh Updated to use binary names from config for process termination
workers/start-workers.sh Added memory limit parsing and per-worker log files
workers/shared/logger_client.rs Added re-exports of JTAG protocol types
workers/inference-grpc/* Added worker pool, GPU synchronization, persona tracking
widgets/* Improved content state management and user identifier handling
system/core/* Added DaemonOrchestrator for wave-based parallel startup
system/routing/* Added room name lookup and server-side resolution functions
Multiple scripts/* Deleted 40+ deprecated test/utility scripts
Files not reviewed (1)
  • src/debug/jtag/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

queuedMessages: daemon.startupQueueSize
});

const totalMs = endTime - startTime;
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable totalMs.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants